In [ ]:
import pandas as pd
data = pd.read_csv("../data/iris.data")
# convert to NumPy arrays because they are the easiest to handle in sklearn
variables = data.drop(["class"], axis=1).as_matrix()
classes = data[["class"]].as_matrix().reshape(-1)
In [ ]:
# import cross-validation scorer and KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
train_X, test_X, train_Y, test_Y = train_test_split(variables, classes)
# initialize classifier object
classifier = KNeighborsClassifier()
# fit the object using training data and sample labels
classifier.fit(train_X, train_Y)
# evaluate the results for held-out test sample
classifier.score(test_X, test_Y)
# value is the mean accuracy
In [ ]:
# if we wanted to predict values for unseen data, we would use the predict()-method
classifier.predict(test_X) # note no known Y-values passed
In [ ]:
In [ ]:
from sklearn.decomposition import PCA # pca is a subspace method that projects the data into a lower-dimensional space
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
pca = PCA(n_components=2)
knn = KNeighborsClassifier(n_neighbors=3)
from sklearn.pipeline import Pipeline
pipeline = Pipeline([("pca", pca), ("kneighbors", knn)])
parameters_grid = dict(
pca__n_components=[1,2,3,4],
kneighbors__n_neighbors=[1,2,3,4,5,6]
)
grid_search = GridSearchCV(pipeline, parameters_grid)
grid_search.fit(train_X, train_Y)
grid_search.best_estimator_
In [ ]:
# you can now test agains the held out part
grid_search.best_estimator_.score(test_X, test_Y)
There is another dataset, "breast-cancer-wisconsin.data". For a description see [here] (https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/) .
It contains samples with patient ID (that you should remove), measurements and as last the doctors judgment of the biopsy: malignant or benign.
Read in the file and create a classifier.
You can alternately just split the input and use some classifier or do a grid cross-validation over a larger space of potential parameters.
In [ ]: